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Abstract 

Over  recent  years.  Adaptive  Mesh  Refinement  (AMR)  algorithms  which  dynamically  match  the 
local  resolution  of  the  computational  grid  to  the  numerical  solution  being  sought  have  emerged  as 
powerful  tools  for  solving  problems  that  contain  disparate  length  and  time  scales.  In  particular, 
several  workers  have  demonstrated  the  effectiveness  uf  employing  an  adaptive,  block-structured 
hierarchical  grid  system  for  simulations  of  complex  shock  wave  phenomena.  Unfortunately,  from 
the  parallel  algorithm  developer’s  viewpoint,  this  class  of  scheme  is  quite  involved;  these  schema 
cannot  be  distilled  down  to  a  small  kernel  upon  which  various  parallelizing  strategies  may  be  tested. 
However,  because  of  their  block-structured  nature  such  schema  are  inherently  parallel,  so  all  is  not 
lost.  In  this  paper  we  describe  the  method  by  which  Quirk  5  AMR  algorithm  has  been  parallelized. 
This  method  is  built  upon  just  a  few  simple  m^age  passing  routine  and  so  it  may  be  implemented 
across  a  broad  class  of  MIMD  machine.  Moreover,  the  method  of  parallelization  is  such  that  the 
original  serial  code  is  left  virtually  intact,  and  so  we  are  left  with  just  a  single  product  to  support. 
The  importance  of  this  fact  should  not  be  underestimated  given  the  size  and  complexity  of  the 
original  algorithm. 

While  the  parallel  version  currently  lacks  some  of  the  advanced  features  of  the  serial  version, 
H  is  sufficiently  mature  that  it  can  be  used  routinely  to  perform  very  large  scale  simulations  of 
detonation  phenomena  using  workstation  clusters.  Hence  the  parallel  algorithm  has  progressed 
beyond  the  level  of  being  solely  an  exercise  in  computer  science  to  become  a  powerful  research  tool 
for  investigating  fluid  phenomena.  Finally,  although  it  will  be  seen  that  we  have  produced  a  fair 
amount  of  paraphernalia  to  parallelize  just  a  single  algorilhin.  it  should  be  appreciated  that  the 
AMR  algorithm  is  itself  .sufficienily  general  to  be  applicable  to  a  large  class  of  problems.  .And  .so 
the  method  described  here  could  be  Icgitirn.'iteK  construed  as  being  a  template  for  parallelizing 
block-structured,  .adaptive  grid  algorilhms. 
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.Science  and  Engineering  (ICASE),  N.A.SA  Langley  Research  Center,  HaHipton,  VA  .t3681. 


mm  QUALITY  lOTPEUreDIB 


1  Introduction 


Despite  the  enormous  potential  power  offered  by  parallel  computers,  it  is  worth  illustrating 
that  brute  force  calculations  are  unlikely  to  be  of  much  use  for  solving  problems  that  contain 
disparate  physical  length  scales.  Consider  the  following  example  taken  from  the  study  of 
detonation  waves. 

The  usefulness  of  explosive  materials  stems  from  their  ability  to  rapidly  convert  chemical 
energy  into  heat  energy.  For  example,  a  good  solid  explosive  converts  energy  at  a  rate  of  the 
order  10^®  watts  per  square  centimetre  of  its  detonation  front.  Thus,  as  noted  by  Fickett 
and  Davis[7],  a  20  m  square  detonation  wave  operates  at  a  power  level  equal  to  all  the  power 
the  earth  receives  from  the  sun!  For  a  given  explosive,  the  rate  of  energy  release  essentially 
depends  on  the  speed  with  which  a  detonation  wave  is  propagated.  Traditionally,  detonatior 
speeds  are  determined  from  experiment.  For  solid  explosives,  a  cylindrical  charge  known 
as  a  rate-stick  is  ignited  at  one  end,  and  the  propagation  speed  is  measured  at  the  other 
end.  It  is  assumed  that  the  length  of  the  stick  is  sufficient  to  allow  the  detonation  front 
to  reach  its  nominally  steady  speed.  Note  that  the  leading  part  of  a  detonation  front  is  a 
strong  shock  wave.  As  this  wave  propagates,  so  che  explosive  materiad  is  compressed  and 
thus  heats  up.  This  raise  in  temperature  triggers  a  chemical  reaction  which  releases  large 
amounts  of  energy  in  the  form  of  heat.  This  enerp'  release  provides  motive  force  for  the 
shock,  and  a  balance  is  reached  such  that  the  chemical  reaction  supports  a  nominally  steady 
speed  of  shock  propagation. 

The  simulation  of  a  rate-stick  experiment  represents  a  formidable  challenge.  Since  the 
chemical  reaction  drives  the  shock  wave,  the  simulation  must  be  able  to  resolve  the  reaction 
zone  accurately.  Results  for  model  problems  suggest  that  at  least  10  mesh  cells  are  needed 
across  the  width  of  the  reaction  zone.  Now  for  certain  types  of  solid  explosive  the  reaction 
zone  may  be  only  0.02  mm  in  thickness,  in  which  case  the  mesh  spacing  within  the  reaction 
zone  must  be  no  larger  than  0.002  mm.  Given  that  a  rate-stick  might  be  100  mm  in  length 
and  100  mm  in  diameter,  some  1.2.5  x  10®  cells  would  be  required  for  an  axisynunetric  flow 
calculation  on  a  uniform  mesh.  From  the  point  of  view  of  nmnerica!  accuracy,  it  is  unlikely 
that  the  detonation  front  could  be  propagated  by  inore  than  one  mesh  cell  per  time  step. 
Consequently  it  would  take  some  5  k  10’  time  steps  for  the  detonation  to  travel  the  full 
length  of  the  rate-stick.  Therefore  the  total  workload  for  the  simulation  would  be  of  the 
order  of  6.25  k  lO’®  cell  updates.  .Such  a  calculation  would  be  absurd.  A  I  Gflop  computer 
might  be  capable  of  10®  luesh  updates  per  second,  in  which  case  the  calculation  would  take 
723  days  to  run!  Clearly,  to  make  such  a  siinnlation  viable,  something  other  than  a  large 
couiputer  Is  required,  hence  the  need  for  adaptive  mesh  refinement. 

.adaptive  mesh  schemes  attempt  lo  match  dynamic.'iliy  the  local  resolution  of  the  com¬ 
putational  grid  10  the  requirements  of  .he  evolving  flow  solution.  Thus  very  fine  mesh  cells 
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are  restricted  to  those  regions  whore  they  arc  needed,  and  elsewhere  the  computational 
grid  may  be  quite  coarse.  Such  a  strategy'  can  dramatically  reduce  the  computational  effort 
required  to  perform  simulations  of  problems  that  contain  disparate  scales.  Returning  to  our 
detonation  simulation,  if  the  fine  mesh  cells  yverc  restricted  to  the  vicinity  of  the  react!  >n 
zone,  only  about  2.5  x  10"’  cells  would  be  Tcqulred  for  the  simulation.  In  yvhich  case  the 
simulation  would  only  require  of  the  order  of  1.2.5  x  lO’®  cell  updates.  Therefore,  whereas 
the  uniform  mesh  simulation  might  lake  723  days  to  run,  the  adaptive  mesh  simulation 
yvould  take  jiisl  208  iiiinutes! 

Because  the  potential  savings  are  so  large,  the  adoption  of  almost  any  form  of  mesh 
adaption,  no  matter  how  naive,  v.'ill  pay  some  dividend.  Coriscc|uently,  a  wide  variety'  of 
strategies  have  been  utilizcd[II].  For  the  simulation  of  compIo.x  shock  yviivo  phenoincna, 
hoyyever,  the  .4MR  algorithm  first  developed  by  Bergcrfl,  2],  and  later  by  Qulrk[l3],  and  by 
FischerfS],  has  proved  to  be  particularly  effective.  Despite  its  effectiveness,  it  is  clear  that 
the  long  term  usefulness  of  the  AMR  algorithm  yvill  depend  on  the  c,xteiit  to  which  it  can 
exploit  parallel  computing  engines.  The  aim  of  this  paper  is  to  show  that  although  the  AMR 
algorithm  is  quite  involved,  and  as  such  it  may  be  thought  to  be  an  unsuitable  candidate 
for  parallelization,  in  actuality  the  algorithm  has  sufficient  inherent  parallelism  so  as  to  be. 
a  good  candidate  for  running  on  MIMD  machines.  Before  proceeding  further  it  should  be 
acknowledged  that  Berger  and  Saltznian(3]  have  already  had  some  success  using  the  AMR 
algorithm  on  a  SIMD  machine.  But  since  this  architecture  necessitates  an  approach  very 
different  from  the  one  described  hers  their  work  will  not  be  considered  further. 

The  rest  of  this  paper  is  as  organized  follows.  In  Section  2  we  pr^nt  some  background 
information  for  the  .AMR  algorithm  together  with  some  of  the  implementation  details  for 
the  original  serial  algorithm  so  that  the  reader  might  better  understand  the  method  of 
parallelization  which  is  described  in  Section  3.  In  Section  4  we  present  two  set  of  rwults 
which  demonstrate  the  worth  of  the  new  parallel  version  of  AMR.  The  first  set  of  results  will 
be  of  more  interest  to  the  computer  scientist,  for  it  deals  with  issues  such  as  how  well  does 
the  algorithm  scale.  The  second  set  of  results  will  be  of  more  interest  to  the  applications 
scientist,  since  it  shows  how  the  parallel  algorithm  might  be  used  in  earnest;  results  are 
presented  from  a  study  of  the  Mach  reflection  of  a  detonation  wave  by  a  ramp.  Lastly,  in 
Section  5  we  present  some  conclusions  that  we  have  drawn  from  this  work. 

2  The  AMR  Algorithm  -  A  Primer 

The  AMR  algorithm  is  a  general  purpose  mesh  refinement  scheme  that  has  priiiiariiy  been 
used  for  studying  shock  wave  phenomena[2, 6, 14].  The  purpose  of  this  section  is  to  provide 
a  primer  for  those  reader.s  who  are  unfamiliar  with  the  algorithm  in  order  that  they  might 
appreciate  the  issues  that  shaped  our  method  of  parallelization.  It  will  soon  become  obvious 


2 


that  the  AMR  algorithm  is  quite  involved,  and  so  here  we  can  do  little  more  than  describe 
what  constitutes  the  algorithm.  No  real  attempt  is  made  to  describe  why  the  algorithm 
chooses  to  do  things  a  certain  way,  nor  how  !t  actually  performs  certain  tasks.  Those 
interested  in  the  full  details  are  strongly  recommended  to  read  Quirk’s  thesis[13].  The 
exposition  given  here  is  broken  down  into  four  parts.  First,  we  dscribe  the  grid  structure 
used  by  the  algorithm,  for  all  other  aspects  of  the  scheme  stem  from  this  structure.  Second, 
we  outline  the  process  which  integrates  the  discretized  flow  solution  contained  by  the  grid 
structure.  Third,  we  outline  the  process  whereby  the  grid  dynamically  adapts  to  the  evolving 
flow  solution,  and  finally,  we  close  this  primer  by  presenting  some  pseudo-code  fragments 
which  show  how  the  .AMR  algorithm  is  organized. 


2.1  Grid  Structure 

The  AMR  algorithm  employs  a  hierarchical  structure  of  embedded  meshes  to  discretize  the 
flow  domain.  The  bottom  of  the  hierarchy,  level  0,  consists  of  a  set  of  coarse  mesh  patches 
which  delineate  the  flow  domain.  Each  of  these  mesh  patchy  forms  a  logic^y  rectangular 
unit  of  cells.  These  patchy  are  restricted  such  that  it  is  possible  to  reference  all  their  cells 
by  a  single  (i,j)  co-ordinate  system,  Cq,  as  shown  in  Figure  1.  This  restriction  ensures 
that  there  is  continuity  of  grid  lines  between  neighbouring  patches  and  that  if  two  patches 
overlap  one  another  then  the  r^ons  of  overlap  are  identical.  Th^  m«h  patches  form  the 
effective  grid  at  level  0,  Go,  and  we  identify  the  patch  by  Go,k^  Usually  the  terms  mesh 
and  grid  are  synonymous,  but  throughout  this  work  we  reserve  the  term  mesh  for  a  single 
logically  rectangular  patch  and  the  term  grid  for  a  collection  of  such  patches. 


Mesh  Patch 


Logica!  Coordinate 
Systes 


Figure  1:  .All  rawhes  are  fixed  relative  to  a  logical  co-ordinate  system. 


The  flow  domain  may  be  refined  locally  by  embedding  finer  m^h  patches  into  the  coarse 
grid  at  level  0  to  form  the  next  grid  level  within  the  hierarchy,  (Ji-  These  embedded  patches 
are  formed  by  linearly  sub-dividing  rectangular  groups  of  coarse  cells.  The  choice  for  the 
number  of  sub-divisions  along  the  edges  of  a  coarse  cell  is  arbitrary,  but  it  must  be  the 
same  for  every  coarse  cell  that  Is  refined.  This  restriction  enables  every  mesh  cell  contained 
at  level  1  to  be  referenced  by  its  own  (i.j)  co-ordinate  system,  Cf  .  In  their  turn,  these 
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embedded  patches  may  contain  even  finer  embedded  patches  which  form  the  grid  G-j.  i  his 
process  of  refinement  may  be  repeated  as  desired  up  to  some  level  Imax-  The  grids  at 
different  levels  within  the  hierarchy  co-exist,  for  n.nderlying  an  embedded  fine  grid  there  is 
a  complete  coarse  grid  and  a  complete  coarse  field  solution,  see  Figure  2.  Note  that  the 
discretized  flow  solution  is  taken  to  be  a  cell-centred  projection  of  the  true  solution. 


Grid  and  density  contours  for  all  3  grid  levels 


Grid  and  density  rontoilrs  for  just  the  lower  2  grid  levels 


Figiirt*  2:  Coarse  grids  exist  Ijerioatli  fine  grids. 

In  order  to  sperify  an  arliilrary  grid  structure,  G’,  it  is  iicce.ssary  to  .supply  the  spati.al 
reniieiiient  factors  rli  and  r./|  for  each  grid  level,  1,  logcllier  with  the  oxlciit  of  c.ach  nicsli 
imlcli,  G'u,  using  Cj  co-ordinaii's.  The  extent  of  a  patch  is  just  given  hy  the  co-ordinalcs 
of  its  lower-loft  and  iippor-right  corners.  All  other  details  for  the  grid  slruclnrc  can  be 
gloaned  from  this  ha.sic  ilala.  For  example,  since  a  simple  relationsliip  exists  hclwcoii  llie 
co-ordinate  sy.steins  Cf  and  Cf_., 


*1-1 


jil-l) 

rli 


1 


and 


Jl-i  = 


(jf-l)  . 
rJ| 


1, 


it  is  possible  to  determine  which  patches  at  level  I  —  I  underpin  a  given  patch  at  level  I, 


4 


For  convenience  we  chose  to  store  the  following  variables  for  the  original  serial  imple¬ 
mentation  of  the  grid  structure, 

LM.4X  ffjiar?  maximum  grid  level. 

For  each  grid  Git 

NGA(L)  nGlj  number  of  grids  contained  by  Gt. 

mP(L)  Gpt,  grid  index  pointer. 

rI(L)  r/|.  Spatial  refinement  factor  in  I  direction. 

rJfL)  rJ|.  Spatial  refinement  factor  in  J  direction. 

For  each  ra^b  patch  Gi,k' 

GRD  =  GP(L)-hK 

LGRID(GRD)  Grid  level  for  the  mesh  Gi,k. 

LMX(GRD)  IMi^,  Width  of  mesh  Gi,k. 

JMX(GRD)  Height  of  mesh  G/,*. 

JNf(GRD)  O/jfe,  extent  of  mesh  Gt^k  using  C?  co-ordinates. 
JSf(GRD) 

IEf(GRD) 

IWf(GRD) 

JNc(GRD)  O^,  extent  of  mesh  G‘,k  nslug  co-ordinates. 

JSc(GRD) 

lEc(GRD) 

IWc(GRD) 

Note  that  the  storage  overheads  are  quite  small,  just  11  variables  are  used  for  each  patch. 
.4Iso  note  that  we  have  assigned  a  unique  index  to  each  me-h  patch;  the  pid  index  pointer 
Gpi  satislira  the  recurrence  relation,  Gpi+i  =  Gpi  +  nGu.  with  Gpo  set  to  zero.  Considertng 
that  an  average  patch  might  contain  upwards  of  1000  cells,  the  overhead  per  cell  is  negligible. 
.4  pair  of  nwted  DO  loops  is  all  that  is  required  to  run  through  the  grid  structure  and  operate 
on  every  m^h. 

DO  L=0,LM.4X 
DO  K=1,NG.4(L) 

GRD  =  GP(L)+K 

Operate  on  niMh  G|,* 

END  DO 
END  DO 

The  discretized  flow  solution  is  stored  In  a  series  of  large  lists  or  heaps;  a  separate  heap  is 
used  for  each  variable.  Each  heap  contains  a  contiguous  set  of  blocks,  one  block  for  each 


mesh,  and  each  block  consists  of  a  contiguous  set  of  storage  elernejits  with  one  elciiicni  for 
each  cell  of  the  mesh.  It  is  convenient  to  view  a  mesh  patch  as  being  surrounded  by  a  border 
of  dummy  cells,  two  cells  deep.  Thus  (IM  +  4)  x  (JM  +  4)  storage  elements  are  required  for 
each  mesh  patch.  The  head  of  each  block  is  found  by  indirection  through  the  list,  H2PTR, 
thus  the  information  for  the  tj**'  cell  of  the  grid,  ORD,  would  be  located  at, 

H2PTR(GRD)  +  (I  +  2)  +  (J  +  1)  ♦  (IMX(GRD)  +  4). 

Note  that  the  cells  within  a  m«h  patch  arc  stored  by  rows.  The  following  code  fragment 
would  access  every  mwh  cell  in  the  data  structure;  the  subroutine  UNPACK.GRID  com¬ 
pute  the  location  pointer,  IJ,  for  the  cell  (1,1)  and  returns  the  stride  lengths  Ibnip  and 
Jbmp  for  the  specified  in^h. 

DO  L=0,LMAX 
DO  K=1,NGA(L) 

GRD  =  GP(L)+K 

CALL  UNPACIC^RID(GRD,l,I,IJ,Ibinp,Jbmp) 

DO  I=1,1MX(GRD) 

Ho  ^  U 

DO  J=MMX(GRD) 

H  ojntalos  a  pointer  to  the  data  stored 
for  the  ij^  ^11  of  Gijt- 

U  =  IJ  +Jbrap 
END  DO 
H  =  IJo+Ibmp 
END  DO 
END  DO 
END  DO 

Connectivity  information  which  is  needed  along  mwh  boundariw,  such  as  which  msh 
patchtt  abut  a  given  patch,  is  also  stored  using  linked  lists,  bat  owing  to  a  lack  of  spMe  no 
details  can  be  pven  here. 

2.2  Flow  Integratmn 

In  principle  any  cell-centred,  flux-based  scheme  developed  for  a  single  topol^cally  regular 
mah  can  form  the  basis  for  the  flow  int^ratlon  proews.  The  duintny  cells  which  surround 
each  mash  patch  are  the  key  to  this  flexibility.  They  cflectively  turn  cdl  interface  along 
mab  boundaries  into  Internal  interface.  Prior  to  integrating  a  grid,  the  dummy  cells 
for  every  mrah  patch  contained  by  the  grid  are  primed  with  data.  Each  mesh  patch  is 


then  processed  indepecdently  of  every  other  mesh  patch  by  some  user  supplied,  black-box 
Int^ator  that  never  actually  ses  a  boundary.  This  is  possible  because  the  data  used 
to  prime  the  dummy  cells  is  chosen  such  that  the  restiltant  numerical  fluxes  along  mesh 
boundaries  are  consistent  with  the  various  boundary  conditions  that  have  to  be  met. 

Consider  the  boundary  shown  in  Figure  3.  The  AMR  algorithm  refines  in 

time  as  well  as  space.  So  for  a  refinement  ratio  of  4  say,  a  fine  m^h  patch  wiH  be  integrated  4 
tim^  with  I  the  size  of  the  time  step  of  its  underlying  coarse  patch.  The  order  of  istepation 
is  sdways  from  coarse  to  fine,  thus  the  co^se  patch  flow  solution  may  be  interpolated  in 
spmre  and  time  to  provide  Dlrichlet  boundary  conditions  for  the  fine  patch.  For  multi-level 
calculations,  the  integrations  of  the  various  separate  grid  levels  are  recursively  interleaved 
so  as  to  minimize  the  time  span  over  which  interpolation  Is  required. 


i  4-  Aii. 


r-o-e-o 


FM 


Figure  3:  A  fim-aiarse  internal  boundary. 

Only  two  other  typ«  of  m^h  boundary  mdst:  fine-fim  boundaries,  where  the  dummy 
cells  of  one  m«h  patch  exactly  overlap  the  mwh  cells  of  another  patch  at  the  same  grid 
level,  in  which  caw  the  appropriate  data  for  the  dummy  cdls  is  simply  shoveled  directly 
from  the  relevant  mi^h  cdls;  txtemal  boundariw,  where  the  data  for  the  dummy  cells  can 
be  inferred  locally,  for  mcample,  at  a  solid  wall  a  simple  reflection  procedure  is  applied  in 
the  usntf  manner. 

The  only  other  operators  that  form  part  of  the  AMR  flow  int^ration  process  are  a 
rwtriction  operator  which  projects  the  flow  solution  rontalned  by  a  fine  mesh  patch  on  to 
its  underlying  coarse  m^h  patch;  this  is  required  for  consistency  purposes.  .And  a  fixup 


operator  which  Is  applied  along  a  fine^cmrse  bonndary  whenever  a  conservative  intcgi.aion 
process  is  required.  Essentially  the  fixup  operator  inodifiw  the  updated  coarse  cell  solutions 
llong  a  fim-coarse  boundary  using  the  cumulative  fiux«  across  the  boundary  as  seen  by 
the  fine  -wtch  during  its  sub-cycle  of  smaller  int^ration  time  steps. 

2i3  Elements  of  the  Adaption  Process 

insider  a  planar  shock  wave  travelling  down  a  duct,  from  left  to  right.' Suppose  a  coarse 
grid,  Gq^  is  used  to  discretiae  the  duct  and  further  suppose  that  an  embedded  ^d.  Git 
which  is  finer  than  Go  covers  the  vicinity  of  the  shock,  see  Figure  4  (a).  Now,  if  the  flow 
solution  on  this  grid  structure  is  int^rated  forward  in  time,  sooner  or  later  the  shock  will 
move  to  within  one  mesh  ceU  of  the  right-hand  edge  of  the  grid  G'l,  see  Figure  4  (b).  The 
fhock  is  about  to  Mn  off  the  edge  of  the  embedded  rash.  At  the  very  Itest  this  act  will 
«use  the  shock  to  Sfticar  to  its  natural  width  on  the  coarse  grid  thus  lowering  the  resolution 
of  the  simiilationi  But  if  the  shock  is  strong,  il  wlU  airo  introduce  spurious  ^lUatiOiis  into 
the  flow  solution,  the  AMR  algorithm  avoids  such  problems  by  dynamically  adapting  the 
grid  strnclure  to  the  evolving  flow.  Here,  the  adaption  proctts  ^nlts  !n  the  emb^ded 
grid,  Git  along  the  roaise  grid,  G©,  m  as  to  keep  pace  with  the  moving  shock  front* 

The  adaption  ptoces  may  be  split  Imisely  Into  three  tasks,  first,  ^ven  a  grid  structure 
and  flow  solutioei  t^ons  of  Interest  are  Identified.  Th^  re^ns  wUI  be  refined,  that  is 
they  will  be  covered  by  an  embedded  grid.  For  our  duct  ^cample  the  adaption  proc^  looks 
at  the  solatton  on  the  grid  Gq.  Uslug  tome  ad  hoc  monitor  function  such  as  the  local  density 
gradient,  the  ceDs  In  the  vicnity  of  the  shock  front  ate  flawed  for  refinement  as  shown  in 


e  clustere  so 


coarse,  roiiowing  ine  auapiion  oi  a  nne  gna,  me  auaniion  proc«s  lor  me  coarse 
next  level  down  must  crisure  that  any  new  coarse  grid  that  inay  be  produccil 
Is  the  finer  grid.  This  job  of  ensuring  “proper  nesting’'  can  only  be  done  If  ihe 
order  of  adaption  is  from  fine  to  coarw. 


Procedure  Adapt(/i  ) 

Initialise.^ 

for  I  =  l„,ax  —  1  down  to  l\  { 

Set.ReflneJF'Iags(/) 

Easure_Ptoper_Nesting(7) 

Gluster(/) 

} 

Transfer  .Solution 
Transfer  .Data.Structure 
BuiIcl.Connectivity(/i ) 

End  Procedure 

S  Method  of  Parallelization 

It  should  now  be  clear  that  the  AMR  algorithm  cannot  be  distilled  down  to  a  small  kernel 
upon  which  one  can  simply  try  out  various  parallelizing  strategies.  And  so  it  might  appear 
that  the  algorithm  is  an  unsuitable  candidate  for  parallelization.  Fortunately,  however, 
the  algorithm  exhibits  a  natural  coarse  grain  parallelism  in  that  individual  mesh  patches 
may  be  processed  largely  in  isolation  from  one  another.  We  have  concentrated  our  efforts 
on  exploiting  this  coarse  grain  parallelism  using  a  message  passing  paradigm.  Given  this 
strategy,  one  might  assume  that  the  resultant  parallel  implementation  would  not  scale  well 
for  large  numbers  of  processors,  but  before  reaching  such  a  conclusion  the  following  two 
points  should  be  considered  carefully. 

Firstly,  by  far  the  bulk  of  the  computational  effort  for  the  AMR  algorithm  is  spent  not 
on  complex  tasks  such  as  the  adaption  process  but  on  the  simple  task,  logically  speaking, 
of  integrating  an  isolated  mesh  patch.  For  the  explicit  integration  methods  that  are  cur¬ 
rently  in  vogue  for  simulating  shock  wave  phenomena,  this  computationally  intensive  task 
exhibits  much  fine  grain  parallelism  that  could  reasonably  be  expected  to  utilize  additional 
processors.  In  view  of  this,  the  ideal  parallel  computing  engine  for  the  AMR  algorithm 
would  probably  be  a  hybrid  MIM  D/SIM  D  machine.  The  orchestration  of  the  algorithm 
would  be  performed  using  our  message  passing  paradigm  on  MIMD  processors,  and  the  low 
level  “number  crunching”  would  be  performed  locally  on  shared  memory,  SIMD  processors 
which  e.xecuted  code  produced  by  a  “smart”  compiler.  Interestingly,  such  hybrid  machines 
are  already  beginning  to  appear  in  the  market  place,  an  example  being  the  CM-5  which  is 
produced  by  Thinking  Machines  Corporation [16]. 

Secondly,  it  should  be  appreciated  that  the  .4 MR  algorithm  is  designed  for  performing 
very  detailed  numerical  simulations;  thus  it  is  safe  to  assume  that  there  are  always  a  large 
number  of  mesh  patches  to  be  processed.  For  example,  the  not  overly  large  problem  shown 
in  Figure  8  contains  318  patches.  .411  in  all,  we  feel  that  for  practical  purposes  our  scheme 


will  not  run  into  scaling  difficulties,  especially  since  economic  strictures  will,  iiion  ulten 
than  not,  severely  limit  the  numbers  of  processors  that  are  available. 

All  the  basic  components  of  the  AMR  algorithm  stem  from  its  choice  of  grid  system.  In 
turn,  most  of  the  implementaiion  details  for  the  algorithm  stem  from  the  way  in  which  the 
grid  system  is  coded.  Therefore  at  the  outset  of  designing  the  parallel  implementation,  we 
decided  that  it  was  necessary  to  try  and  preserve,  as  far  as  was  possible,  the  grid  description 
employed  by  the  original  serial  version  so  £is  to  be  able  to  re-cych  large  portions  of  the 
existing  code.  It  later  transpired  that  given  a  layer  of  “parallel  machinery"  we  could  actually 
re-use  all  the  old  serial  code  in  a  new  SPMD  (Single  Program  Multiple  Data)  parallel  code. 
We  now  describe  this  machinery  in  some  detail.  First  we  give  some  of  the  implementation 
details  for  the  parallel  grid  structure  and  its  associated  data  storage  mechanism,  since  these 
fixed  the  ease  with  which  the  parallelization  was  accomplished.  Next  we  give  an  abstract 
description  for  the  message  passing  paradigm  which  underpins  our  parallel  implementation. 
Finally,  we  describe  how  the  new  parallel,  message  passing  AMR  algorithm  is  organized. 

3.1  Parallel  Grid  Structure 

In  essence,  the  serial  code  maintained  a  unique  identifier  for  each  mesh  patch;  thus  the  grid 
structure  could  be  described  using  a  set  of  tables  which  were  simply  indexed  using  the  mesh 
identifiers.  For  the  parallel  implementation  we  have  adopted  the  same  approach.  Firstly, 
since  we  restrict  ourselves  to  a  coarse  grain  parallelism,  we  can  postulate  that  a  mesh  patch 
need  only  ever  reside  on  a  single  processor.  After  all,  as  was  already  the  case  with  the  serial 
implementation,  a  large  patch  can  always  be  split  up  into  two  or  more  smaller  patches. 
Given  the  number  of  mesh  patches  on  each  processing  node,  it  is  possible  to  construct  a  set 
of  unique  identifiers  for  the  mesh  patches  in  a  grid  structure  distributed  across  one  or  more 
processors.  Thus  it  is  possible  to  maintain  a  reversible  mapping  between  some  global  mesli 
identifier  and  the  information  pair  consisting  of:  the  identity  of  the  node  which  owns  the 
patch  referenced  by  the  global  identifier  and  the  local  index  by  which  this  node  references 
the  patch; 

global  mesh  identifier  <;=^  (processor  id,  local  mesh  identifier). 

Secondly,  given  the  .small  amount  of  information  that  is  required  to  define  the  grid  structure, 
it  is  not  unreasonable  to  store  a  local  copy  of  the  abo\’c  mapping,  together  with  the  global 
grid  tables,  on  each  proco.ssing  node.  For  our  hiiplementation  this  storage  overhead  amount.s 
to  just  .52  bytes  per  mesh  patch;  although  small,  this  overhead  could  bo  reduced  to  about  20 
bytes,  since  for  conv’enicncc  we  store  certain  derived  variables.  Now  a  larged  sized  simulation 
might  consist  of  1024  mc.sh  palches  with  an  average  sized  patch  of  .50  x  .50  cells.  If  this 
problem  were  spread  arros.5  128  nodes,  the  local  copy  of  the  grid  structuro  would  amount  to 
less  than  6%  of  the  memory  required  to  store  the  flow  solution  contained  on  a  specific  node 


(a  typical  user-suppUed  flow  solver  might  store  between  8  and  10  double  precision  variables 
for  each  mesh  cell).  Obviously  having  either  a  smaller  average  grid  size  or  a  larger  number 
of  processors  increases  this  storage  overhead,  but  the  strategy  of  keeping  a  local  copy  of  the 
global  grid  structure  remains  attractive.  However,  if  at  some  future  date  the  overhead  ever 
became  prohibitively  large,  it  would  be  possible  to  make  use  of  the  information  contained 
in  the  mesh  extents  so  as  to  partition  each  of  the  global  grid  tables  into  some  form  of 
distributed  description  table. 

Since  the  grid  data  structure  for  the  parallel  implementation  is  effectively  the  same  as 
that  for  the  old  serial  implementation,  all  of  the  original  coding  may  be  reused.  For  example, 
the  code  fragment  for  accessing  each  cell  of  the  grid  structure,  given  on  page  6,  now  has  the 
form: 


DO  L=0,LMAX 
DO  K=1,NGA(L) 

GRD  =  GP(L)+K 
i  fjnodej>wns.grid{GKD) 

CALL  UNPACK.GRlD(GRD,l,l,IJ,Ibmp,Jbrap) 

DO  I=1,IMX(GRD) 

IJo  =  IJ 

DO  J=:l,JMX(GRD) 

IJ  contains  a  pointer  to  the  data  stored 
for  the  tj***  cell  of  Gi,k> 

IJ  =  IJ  +Jbmp 
END  DO 
IJ  =  IJo+Ibmp 
END  DO 
endjiodeJf 
END  DO 
END  DO 

The  directives  i fjnodejowns.grid  and  endjwdeJ f  have  simply  been  inserted  into  the  orig¬ 
inal  code  so  as  to  ensure  that  an  individual  node  only  processes  those  mesh  patches  for 
which  it  owns  the  flow  solution.  Note  that  the  variables  LMAX,  NGA,  GP,  IMX  and  JMX 
form  part  of  the  global  grid  tables  and  are  theiefoie  stored  locally  on  each  node.  It  should 
also  be  noted  that  we  only  maintain  one  code  for  both  the  serial  and  the  parallel  versions  of 
the  AMR  algorithm.  For  the  serial  code,  the  above  directives  are  simply  mapped  to  blank 
lines  during  a  pre-processing  phase  of  the  compilation,  while  for  the  parallel  code  they  are 
substituted  by  a  FORTR.4N  “IF  ...THEN”  construct  which  tests  to  see  if  the  mesh  patch 
GRD  is  located  on  the  node  e.xccuting  the  code. 
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Another  reason  as  to  why  the  above  code  fragment  may  be  so  easily  pnralli  lies 
buried  within  the  sukrontine  I'NPACK.GRID.  As  was  described  in  Section  2.1,  Jill  flow 
variables  are  stored  using  a  number  of  heaps,  one  lieap  for  each  variable,  with  the  data 
associated  with  the  tj**'  cell  of  the  grid,  GRD,  being  stored  at  the  location 

H2PTR(GRD)  +  (I  +  2)  +  (J  +  1)  ♦  (LMX(GRD)  +  4). 

In  the  parallel  implementation  this  location  is  now  given  by 

H2PTR{/ina«(GRD))  +  (I  +  2)  +  (J  +  1)  *  (IMX(GRD)  +  4), 

where  the  macro  lindex  returns  the  local  mesh  index  corresponding  to  the  global  mesh 
index,  GRD.  Thus  each  node  maintains  its  own  heap  data  storage  in  exactly  the  same  way 
as  the  serial  code,  but  it  only  stores  data  for  those  patches  for  which  it  is  deemed  responsible. 

Lastly,  as  before,  we  can  recover  the  old  serial  code  from  the  new  parallel  implementation 
by  simply  m.'  ’  'ng  the  appropriate  substitutions  for  the  macro  Hndex  at  compile  time. 

3.2  A  Message  Passing  Paradigm 

Our  parallel  implementation  for  the  basic  AMR  algorithm  is  based  upon  a  message  passing 
paradigm.  At  various  junctures  during  run  time,  the  access  of  data  is  identified  as  being 
either  local  or  non-local.  For  any  access  which  is  non-iocal,  that  is  the  required  data  iics 
off-processor,  the  appropriate  inter  node  communication  tasks  are  first  scheduled  and  then 
later  executed  as  a  series  of  sends  and  receives.  In  this  section  we  describe  how  these 
inter  node  communications  are  orchestrated.  Our  message  passing  machinery  relies  only 
on  a  liirdted  functionality  being  available  at  the  system  level;  the  high  level  procedures 
described  here  are  supplemented  by  a  small  user  supplied  library  (typically  less  than  two 
hundred  lines  of  code)  which  is  dependent  on  the  target  platform.  Thus  %ve  have  been  able 
to  ensure  the  portability  of  the  AMR  algorithm.  To  date,  we  liave  ported  the  algorithm 
to  a  dedicu,ted  parallel  computer  (a  32  node  iPSC/860  machine)  using  its  native  message 
passing  routines  [9],  and  to  the  following  workstation  cluster  environments:  PVM  (Parallel 
Virtual  Machine)  [4],  and  APPL  (.Application  Portable  Parallel  Library)  [12], 

Given  our  assumption  that  a  mesh  patch  resides  on  just  one  processor,  there  arc  ordy  a 
few  key  tasks  within  the  AMR  algorithm  that  might  necessitate  accessing  non-local  data. 
The  most  visible  of  these  tasks  is  the  job  of  priming  the  dummy  cells  which  surround  each 
mesh  patch.  In  the  original  serial  implementation,  whenever  tlie  grid  structure  changed, 
the  inter  mesh  connectivity  was  recomputed  so  as  to  build  a  schedule  of  the  individual  data 
accesses  required  for  priming  the  dummy  cells  of  the  grid.  Since  we  store  a  local  copy  of 
the  grid  description  on  each  node,  this  procedure  remains  the  same  as  that  for  the  serial 
implementation.  Basically,  the  connectivity  information  is  found  by  comparirig  the  nicsli 
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extents  between  p.itches  on  consecutive  grid  levels.  Where  appropriate,  external  boundary 
inforination  lakes  precedence  over  fine-fine  information  wliich  in  turn  takes  precedence 
ox^x  fine-coarse  information.  Once  this  data  access  schedule  has  been  produced  it  may  be 
parsed  so  as  to  find  which  of  its  entries  involve  nca-local  data.  These  entries  are  collected 
and  stored  in  a  separate  schedule.  The  following  pseudo-code  illustrates  this  procedure;  it 
builds  the  connectivity  information  for  grid  levels  l\  to  Unax- 

Proesdnrs  Buiid_Cor»iectivity(/i ) 

for  /  =  /|  to  UiiaxX 

Reset_Bdy_Types(/) 

FIag_Coarse_Bdy(/) 

Flag.Fine3dy{/) 

SetJExternaLBdyf/) 

Check  J^esting(/) 

} 

Build  .Off  JProcessor_BC-Schedule(/i ) 

End  Procedure 

The  innards  of  the  five  procedures  within  the  “for”  loop  are  too  involved  to  be  described 
here,  but  they  are  identical  to  »hose  in  the  serial  implementation  except  for  the  addition 
of  some  if.nodc.owns.grid  directives  which  ensure  that  a  processor  only  works  on  those 
patches  which  it  owns;  we  remind  the  reader  that  full  details  of  the  algorithm  have 
been  given  else.whcre[13]. 

The  procedure  Build_Off_Processor_BC_ScheduIe  oversees  tlsc  production  of  the 
schedule  of  tasks  which  involve  acce.ssing  noielocal  data.  The  resulting  schedule  simply 
consists  of  a  set  of  rccpiests  for  information;  each  request  identifies  a  processing  node  together 
with  a  portion  of  a  mesh  owned  liy  that  processor  which  contains  the  desired  data. 

Procedure  Build_Off_Processor-BC_Schedule(/i ) 

for  I  =  l\  to  Unax  { 

Reset_Requests(/) 
for  k  =  I  to  iiG}  { 

i  f^nodc.owns,gr{d{Gpi  +  k) 

BuiId_Requests(C7p/  +  k) 
cnd.nodc.if 

} 

} 

for  I  —  f|  to  Unax  i 

Transmlt,Requests(/) 

} 

End  Procedure 


A  global  synchronization  is  performed  in  the  procedure  Transmit.Requests,  following 
which,  each  separate  request  is  sent  to  the  specific  processing  node  that  will  eventually 
supply  the  oflf-prccessor  data.  When  all  the  requests  have  been  transmitted,  each  processing 
node  has  two  local  lists.  For  each  grid  level,  I,  the  first  list  contains  the  details  for  nSi  itejus 
of  data  that  the  node  must  send  out  when  the  dummy  cells  for  Gt  are  being  primed,  while 
the  second  fist  contains  the  details  for  the  nRt  items  of  data  that  the  node  is  expecting  to 
receive  during  the  priming  process.  Given  these  send  and  receive  schedules,  the  procedure 
which  oversees  the  priming  of  the  dummy  cells  for  the  grid,  (?/,  has  the  fonn: 

Procedure  Set_BC(.',  Nt) 
for  fc  =  1  to  nGs  { 

ifjiode..owns.grid{Gpt  +  k) 

Set_OnJ*roce88orJBCfiVt,  Gpt  +  k) 
endjiodeJf 

) 

Set_Dff_Processor-BC(l,  Nt) 

End  Procedure 

The  call  to  Set.OnJ*rocessor_BC  performs  all  the  data  movements  that  involve  just  local 
accesses,  therefore  tills  routine  is  the  same  as  that  for  the  serial  implementation,  while  the 
procedure  Set-Off  JProcessor_BC  performs  all  the  movements  that  involve  non-local  data. 
Note  that  the  integration,  sub-cycling  Index,  ATf,  is  required  so  that  the  correct  interpolation 
in  time  can  be  done  at  a  fine**xarse  boundary,  see  Section  2.2. 

Procedure  Sei.Off_ProceasorJ9C(l,  Nt) 

Synchronize  J'lodM 
for  item  =  I  to  nS|  { 

Pack3d8gBafi(ttem,  Nt) 

Node  =  6et_Node(  item) 

Snd.BCJM[sg(  iVode) 

} 

for  item  =  I  to  nRt  { 

Wait_For_BC_Msg 
Rcv_BC_Msg 
Unpack3f  sgBu^  N  t) 

} 

End  Procedure 

The  non-local  data  movements  take  place  as  follows.  First,  all  the  processing  nodes  are 
synchronized,  then  each  node  works  through  its  fist  of  messages  to  send.  Each  separate 
n  essage  is  packed  into  a  buffer  and  then  sent  to  the  appropriate  node  using  some  low-level, 
sys  em  dependent  routine.  Once  a  node  has  sent  out  all  its  messages,  it  is  ready  to  receive 
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any  incoming  messages  that  it  might  he  expecting.  The  order  in  which  the  messages  arrive 
is  unimportant.  A  node  knows  that  it  will  be  sent  nRi  messages,  so  it  simply  waits  for 
that  number  of  messages  to  arrive.  Having  received  a  mes.sage,  again  via  some  low-level, 
.system  dependent  routine,  it  simply  decodes  an  identifier  from  within  the  message  so  as 
to  determine  w'here  the  incoming,  off-processor  data  should  be  stored.  Note  that  a  similar 
process  of  sends  and  receives  may  be  tised  in  the  other  tasks  (principally  the  procedures 
Project_Solution  and  Trai.sfer_Solution)  that  might  want  to  access  non-local  data. 

3.3  The  Parallel  AMR  Algorithm 

A.»art  from  a  couple  of  additions  made  to  the  adaption  process,  the  basic  organization  of 
the  parallel  ,4MR  algorithm  remains  the  same  as  that  for  the  serial  implementation.  We 
now  describe  these  additions,  but  first  we  simply  present  the  new  version  of  the  procedure 
Adapt,  c.f.  the  version  given  on  page  11. 

Procedure  Adapt(/| ) 

Initialize.G’ 

for  /  =  -  I  down  to  /i  { 

Set_Refine_Flags(/) 

Ensure  J*roper_Nestmg(f) 

Ciuster(l) 

Exchange.New_£xtentsf /) 

} 

Distrlbute_Grids(/i ) 

Transfer  .Solution 
Transfer-Data.Structure 
BuiId_Connectivity(i| ) 

End  Procedure 

As  has  been  described  previously,  the  clustering  process  produces  a  set  of  mesh  e.xtcnts 
which  describe  the  newly  adapted  gritl  G'«.  However,  for  the  parallel  implementation,  the  call 
Cluster(7)  will  only  produce  a  subset  of  the  mesh  extents  for  Gi,  since  a  node  only  proccs.ses 
those  patches  which  it  owns.  Therefore  following  the  call  to  Cluster,  it  is  necessary  to 
perform  a  global  exchange,  amongst  the  proce.ssors.  of  the  mesh  extents  found  locally.  This 
is  done  with  the  call  to  Exchange_New  JIxtents.  This  global  exchange  allows  each  node 
to  a.ssemble  the  global  grid  <le.srriplion  table  for  the  newly  adapted  grl<l,  G'/. 

Once  each  processor  has  the  global  de.scription  for  the  grid,  snllicient  information  is 
available  to  dcteriiiine  bow  the  nm^  mesh  palrhe.s  should  be  distributed  .so  as  to  achieve  a 
satisfactory  load-balance  amongst  the  different  prcjcessing  no<les:  thi.s  is  done  via  the  call 
Distribute.Grids.  .As  yet,  we  do  not  have  any  nniver.sal.  distribution  strategv  that  will 
work  well  in  all  ca.se.s.  J’istead.  we  have  simply  cl!0.sf-!!  one  of  .several  heuristic  strategies 


depending  upon  the  particular  flow  problem  that  we  are  solving.  The  question  of,  how 
should  one  distribute  the  mesh  patches,  remains  an  active  area  of  research.  However,  our 
algorithm  is  structured  such  that  different  distribution  strategies  may  be  swapped  in  and 
out  at  will.  Therefore,  it  matters  little  at  this  stage  that  we  have  no  universal  answer. 

4  Numerical  >  suits  and  Discussion 

We  now  present  two  sets  of  results  which  demonstrate  the  efficacy  of  our  parallel  imple¬ 
mentation  of  the  AMR  algorithm.  The  first  set  of  results  concentrates  on  computer  science 
issues,  such  as  how  well  does  the  algorithm  scale,  and  the  second  set  shows  how  the  algo¬ 
rithm  can  be  used  as  a  tool  to  provide  insight  into  basic  fluid  phenomena. 

4.1  Performance  Issues 

In  order  to  determine  the  performance  of  a  parallel  algorithm,  it  is  common  practice  to 
carry  out  at  least  one  of  the  following  two  studies.  For  varying  numbers  of  processors,  one 
monitors  the  time  taken  to  solve,  either  a  fixed-sized  problem,  or  a  problem  that  is  scaled 
in  such  a  way  that  the  workload  per  processor  remains  con.stant.  For  reasons  wltich  will 
follow,  neither  study  is  entirely  satisfactory  for  determining  the  performance  of  our  parallel 
.4MR  algorithm.  Nevertheless,  we  have  carried  out  both  studies,  the  results  of  which  we 
present  below.  But  first,  we  make  the  following  observations. 

Assuming  a  perfect  load-balance  amongst  processois,  effectively,  it  is  only  the  time  spent 
on  inter  node  communications  that  impacts  on  the  performance  of  our  algorithm.  On  the 
scale  of  things,  the  overhead  for  looping  over  every  mesh  patch  in  the  grid  structure  as  is 
done  by  the  code  fragment  on  page  13,  rather  than  looping  over  just  those  patches  that 
a  node  owns,  is  negligible.  Consequently,  for  a  given  number  of  processors,  the  efficiency 
of  the  algorithm  wiU  be  strongly  related  to  the  ratio  of  the  amount  of  computation  to  the 
amount  of  communication  Therein  lies  the  first  difficulty  for  assessing  the  performance  of 
our  algorithm-  The  AMR  algorithm  is  a  general  purpose  scheme  which  is  not  tied  to  any 
one  flow  solver.  Thus  for  a  fixed  grid  structure,  that  is  a  fi.xed  amount  of  communication, 
the  amount  of  computation  can  vary  tremendously  depending  upon  the  application.  Even 
for  a  fixed  application,  one  might  have  a  choice  of  several  different  numerical  schemes  to 
perform  the  “number  crunching".  For  example,  for  the  application  shown  in  the  next 
section,  depending  on  the  circumstances,  we  have  in  the  past  used  schemes  that  are  4-o 
times  more  expensive  than  the  one  used  for  this  paper.  Despite  such  uncertainties,  of  one 
thing  we  can  be  sure,  using  an  expensive  scheme  wiU  flatter  any  performance  figures  that 
are  measured.  For  that  reason,  all  the  performance  figures  given  in  this  section  are  for  the 
flow  solver  used  in  the  next  section.  Since  this  solver  is  one  of  the  least  expensive  we  use. 
the  performance  figures  given  below  may  be  considered  as  conservative  estimates. 


Returning  to  our  two  performance  studies,  it  is  well  known  that  a  study  in  which  the 
flow  problem  is  scaled,  so  as  to  keep  the  computational  load  per  processor  constant,  can 
hide  poor  performance  caused  by  either  a  large  serial  component  in  the  algorithm  under 
test,  or  a  large  system  overhead[10].  Of  the  two  studies,  however,  this  one  more  accurately 
reflects  the  way  in  which  our  algorithm  is  used  in  practice.  Again,  it  is  worth  emphasizing 
that  the  AMR  algorithm  is  designed  for  performing  very  detailed  simulations.  Indeed,  to 
this  end,  the  serial  Implementation  has  been  quite  successful  for  investigating  inert  shock 
wave  phenomena;  a  typical  simulation  might  take  a  day  to  run  on  a  high-end  workstation 
and  require  around  96  Mbytes  of  storage.  Therein  lies  a  second  difficulty  in  assessing  the 
performance  of  our  algorithm.  The  only  dedicated  parallel  machine  that  is  available  to 
us  has  a  paltry  8  Mbytes  of  memory  per  node,  of  which  maybe  only  half  is  available  for 
storage  once  the  operating  system  and  load  module  have  had  their  share.  Consequently 
it  is  impossible  to  run  a  fixed-sized  test  problem  of  any  consequence  on  such  a  machine. 
After  all,  a  problem  that  would  fit  into  4  Mbytes  might  take  only  30  minutes  to  run  on  a 
workstation.  What  is  the  merit  in  solving  this  in  under  30  seconds  on  a  massively  parallel 
machine,  if  the  results  are  going  to  take  a  day  to  analyze! 

When  investigating  fluid  phenomena  our  modus  opemndi,  in  common  with  many  of  our 
colleagues,  is  to  scale  the  problem  size  such  that,  given  the  available  computing  r^ources, 
the  results  are  delivered  within  some  acceptable  time  limit.  Since  we  utilize  colleagues’ 
machines  daring  olf-hours  for  our  workstation  clusters,  typically  our  problems  are  scaled  to 
match  one  of  two  time  slots,  either  the  12  hours  avmlable  daring  a  week  night,  or  the  49 
odd  hours  available  over  a  weekend;  the  simulation  shown  in  Figure  8  was  performed  in  one 
such  weekend  time  slot. 

Despite  our  mispvings,  we  have  performed  the  following  fixed-sized  performance  study. 
Thirty  two  horizontal  stripes,  which  are  stacked  one  above  the  other,  are  used  to  discretize 
a  duct  along  which  a  detonation  wave  is  propagated.  Each  stripe  consists  of  .50  by  4  coarse 
cells  and  contains  one  level  of  adaption,  with  a  refinement  ratio  of  4,  so  as  to  improve 
the  resolution  of  the  detonation  wave.  Tests  were  run  on  a  32  node  iPSC/860  hypercube 
machine,  and  two  workstation  clusters:  one  of  16  Sun  ELCs,  and  one  of  8  SUN  SP.^RClOs; 
both  clusters  used  an  ethernet  ring.  For  the  workstation  trats  we  utilized  the  APPLJ12] 
message  passing  library.  If  there  weie  fewer  nodes  than  strips,  the  strips  were  distributed 
in  blocks.  For  example,  four  neighbouring  stripes  would  be  distributed  on  each  node  of  an 
8  node  cluster. 

For  each  of  the  parallel  environmenls  testeil.  Figure  5  show's  a  plot  of  the  measured 
efficiency  against  the  nutnbeis  of  nodes  used  to  run  the  fixed-sized  problem.  Here  the 
efficiency  is  given  by  the  ratio  of  the  wall  clock  lime  taken  to  execute  the  problem  on  a 
single  node,  to  the  wall  clock  lime  taken  to  execute  the  problem  on  n  processors  scaled 
n  times.  Note  that  for  the  workstation  clusters  there  is  a  sharp  drop  in  efficiency  across 
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the  test  range,  while  for  the  hypercube  the  efficiency  only  drops  below  94%  for  32  nodes. 
Since  the  identical,  load-balanced  problem  was  executed  in  each  case,  in  all  probability,  the 
low  communications  bandwidth  of  the  ethemet  ring  became  saturated  whereas  the  scalable, 
high  bandwidth  network  used  by  the  hypercabe  did  not.  To  test  this  hypothesis,  a  simple 
t^t  program  was  run  which  attempted  to  pass  a  large  number  of  mrylng  sized  messages 
around  a  ring  of  workstations,  as  quickly  as  was  possible.  This  test  represents  a  crude 
approximation  to  the  communication  process  during  the  priming  of  the  dummy  cells.  The 
r^ults  from  this  test  are  shown  in  Figure  6,  they  indicate  that  the  bandwidth  of  our  ethemet 
ring  is  just  1  Mbyte/sec.  The  two  vertical  lines  show  the  range  of  typical  mess<^e  sizes  used 
during  the  priming  process,  from  which  it  can  be  seen  that  the  ethemet  ring  would  indeed 
become  saturated.  Note  that  it  takes  just  two  machine  to  saturate  the  network!  Now  we 
have  observed  a  ten-fold  increase  in  bandwidth  when  switching  from  ethemet  to  fibre  optic 
PDDI.  This  increase  in  bandwidth  would  significantly  delay  the  point  at  which  network 
saturation  tak^  place.  Unfortunately,  our  workstation  clusters  have  not  yet  been  upgraded 
to  take  advantage  of  this  improved  techuolc^. 


Figure  5:  Measured  efficiencies  for  the  fixed-sized  test  problem. 


(a)  Total  Bandwidth 


(b)  Bandwidth  per  Processor 


Figure  6:  Ethernet  bandwidth  as  a  function  of  both  message  size  and  fluster  size. 
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As  an  aside,  it  should  be  noted  that  every  effort  was  made  to  find  the  b^t  set  of 
compile  options  for  a  given  machine  so  as  not  to  artificially  improve  the  performance  figures 
by  needlessly  increasing  the  computation  time  relative  to  the  communication  time.  For 
example,  on  a  Sun  SparcIO  our  selection  of  compile  switches  gives  a  CPU  saving  of  26% 
compared  to  a  naive  optimization  using  just  the  ‘*-03”  compilation  switch. 

Given  the  available  bandwidth,  the  poor  performance  of  the  workstation  clusters  merely 
suggests  that  the  ratio  of  computation  to  communication  was  too  small  for  the  fixed-sized 
problem.  But  this  ratio  was  set  artiacially  small  just  so  the  problem  would  fit  on  a  single 
node  of  the  hypercube.  A  more  realistic  ratio  can  be  obtained  by  running  a  scaled  problem 
were  each  node  proc^ses  at  most  one  stripe;  the  more  nodes,  the  more  strips  that  are 
used  for  the  test.  We  have  carried  out  a  performance  study  for  our  algorithm  on  such  a 
scaled  problem  using  a  stripe  which  consisted  of  one  coarse  m@h  of  100  by  8  cells  that 
contmned  two  levels  of  refinement,  each  with  a  refinement  ratio  of  4.  Figure  7  shows  the 
results  of  this  study.  Here  the  efficiency  is  defined  as  the  ratio  of  the  serial  wall  clock  time 
to  the  parallel  wall  clock  time.  Once  again,  the  algorithm  performs  extremdy  well  on  the 
hypercube.  Note  that  the  efficiency  is  still  above  97%  for  32  nodes.  This  time,  however, 
the  drop-off  in  efficiency  for  the  workstation  clusters  has  reached  acceptable  levels.  For 
example,  either  16  ELCs  or  8  Sparc  10s  may  be  used  at  over  80%  efficiency.  Again  we 
would  like  to  emphasize  that  this  scaled  problem  is  more  representative  of  how  we  use  the 
AMR  algorithm  than  is  the  fixed-sized  problem.  Therefore,  as  will  be  shown  in  the  next 
section,  the  above  efficiencies  are  readily  obtainable  on  real  problems. 


Figure  7:  Measured  efficiencies  for  the  scaled  test  problem. 

In  summary,  the  above  results  shot  that  the  .AMR  algorithm  scales  extremely  well  on 
a  dedicated  parallel  machine  that  offers  a  scalable  data  network.  But  more  importantly, 
perhaps,  the  algorithm  can  deliver  a  good  performance  on  workstation  clusters,  even  when 
communications  are  via  ethernet.  With  the  tenfold  increase  in  bandwidth  observed  with 
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fibre  optic  connections,  it  would  not  be  unreasonable  to  expect  to  be  able  to  use  more  than 
20  machines  at  well  over  80%  efficiency,  provided  that  the  problem  was  reasonably  well  load- 
balanced.  Given  that  there  is  a  trend  towards  vendors  offering  multi-processor  workstations, 
the  effective  number  of  nodes  that  could  be  used  efficiently  would  reach  three  figures.  Fur¬ 
ther  advances  in  network  technology  will  increase  this  figure  still  further.  Certainly,  for  the 
sorts  of  application  for  which  the  .4MR  algorithm  is  intended,  there  is  little  danger  in  the 
foreseeable  future  of  it  performing  poorly  simply  because  of  a  surfeit  of  processing  nodes. 

Before  moving  on  to  the  next  section  it  is  worth  mentioning  one  practical  advantage  of 
our  strategy  of  attacking  the  coarse  grain  parallelism  within  the  AMR  algorithm.  Given 
the  mplexity  of  the  algorithm,  one  can  never  be  sure  that  a  specific  implementation  is 
entirely  fiee  of  bugs.  However,  through  several  years  of  use,  it  is  clear  that  our  serial 
implementation  has  reached  a  stage  whereby  it  might  be  considered  safe  to  assume  that 
any  remmning  bugs  are  benign.  Since  our  parallel  implementation  performs  the  same  set 
of  operations  on  the  same  set  of  data  as  the  serial  version,  there  is  no  reason  why  it  should 
not  ^ve  the  same  results,  even  down  to  round-off  errors.  After  some  vigorous  debugging  we 
can  claim  that  this  is  indeed  the  case.  If  we  had  tackled  a  finer  grain  parallelism  from  the 
outset,  we  probably  would  not  have  been  able  to  utilize  such  a  stringent  quality  control  t^t 
as  checking  for  zero  differences  in  round-off.  Admittedly,  one  expects  a  good  algorithm  to 
be  insensitive  to  round-off  errors,  and  therefore  we  might  well  have  been  content  to  settle 
for  having  just  very  small  discrepancies  between  the  r^ults  produced  by  two  nominally 
equivalent  implementations  of  our  algorithm.  But,  based  upon  previous  experiences,  there 
would  always  remain  a  ni^ng  doubt  that  some  subtle  bug  bad  been  left  uncovered.  We 
are  now  well  placed  to  extend  our  work  by  tackling  the  fine  grain  pardlelism  within  a 
mesh  patch.  Given  the  relative  simplicity,  lo^cally  speaking,  of  the  tasks  performed  within 
a  patch,  there  would  be  no  merit  in  trying  to  maintain  the  same  stringent  control  over 
round-off  errors  for  this  future  work. 


4.2  A  '^Real  World**  Application 

Although  classically  modelled  as  a  quasi  one-duncnsional  structure  consisting  of  an  inert 
shock  front  followed  by  a  reaction  zone,  a  detonation  wave,  in  marked  contrast  to  an  ordinarj* 
shock  wave,  can  be  highly  two-dimensional  in  nature[7];  a  detonation  front  rarely  remains 
planar.  Over  the  years,  experiments  have  revealed  that  a  detonation  front  often  supports 
a  complex  system  of  transverse  waves  which  trace  out  a  *^fish-scale"  pattern,  thus  pving 
rise  to  a  characteristic  length  (the  height  of  one  scale)  known  as  the  detonation  ceU  size. 
Recently,  modern  numerical  schem«  have  been  utilized  in  an  attempt  to  improve  our  general 
understanding  of  this  phenomena[.5,  14].  -Although  such  numerical  simulations  have  hmn 
reasonably  successful,  it  is  clear  that  they  have  a  number  of  shortcomings.  For  example. 
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whereas  most  calculations  have  been  for  fronts  which  arc  just  one  or  two  ceils  in  length,  a 
leaiistically  sized  problem  wouid  necessitate  using  a  front  that  contained  tens  of  cells.  This 
shortcoming  is  an  indication  of  the  e,xpense  of  such  simulations.  Today,  however,  using  our 
parallel  AMR  algorithm  running  on  a  relatively  small  number  of  workstations,  we  are  able 
to  perform  routine  calculations  that  involve  ten  or  more  detonation  cells. 

Figure  8  shows  results  from  one  such  simulation,  a  detonation  wave  reflecting  from  a  20“ 
ramp.  The  left-hand  plate  shows  a  single  Schlieren-type  snapshot  which  clearly  shows  the 
e.xpected  Mach  reflection  flow  pattern,  while  the  right-hand  plate  shows  the  characteristic 
“fish-scale"  pattern  traced  out  by  the  transverse  waves.  This  last  plate  is  rasentially  a  smear 
image  of  the  vorticity  field  as  the  detonation  front  interacts  with  the  ramp,  and  it  is  therefore 
comparable  to  the  soot  trace  records  that  arc  produced  cxperimentally(7];  the  details  for 
how  we  produce  numerical  soot  traces  are  given  elsewherefl.a].  Note  that  the  lighter  a  point 
within  the  image,  the  higher  the  peak  vorticity  that  has  passed  that  point  in  space,  and  so 
the  edges  of  the  “fish-scales"  correspond  to  the  tracks  traced  out  by  the  numerous  triple 
points  along  the  length  of  the  detonation  front.  Also  note  that  the  detonation  celt  size 
behind  the  Mach  stem  is  considerably  smaller  than  that  behind  the  incident  front.  .A^n 
this  is  in  aiccordanro  with  experimental  observation.  Since  the  Mach  stem  moves  normal 
to  the  ramp  surface,  it  must  be  travelling  faster  than  the  incident  front;  thus  the  ratio  of 
its  sp^  to  the  Chaproan-Jonguet  velocity  (the  minimum  sustainable  detonation  speed)  is 
also  higher  than  that  of  the  incident  front.  Generally  speaking,  the  higher  this  ratio,  the 
more  stable  a  detonation  front  becomes  and  so  its  cell  size  decreases. 

Given  the  performance  remits  that  were  prsented  in  the  previous  section,  it  would  serve 
no  purpe^  here  to  pr^nt  further  detailed  figures  for  the  ramp  simulation.  Suffice  it  to 
say,  such  simulations  have  been  run  for  up  to  40  consecutive  hours  on  eight  Sun  SparcIO 
workstations,  during  which  sustained  effidencte  of  over  80%  were  achieved,  despite  the  fact 
that  all  communications  were  via  ethemet  rather  than  fibre  optic  links.  While  we  do  note  a 
few  spedfic  details  as  to  how  the  ramp  calculation  was  performed,  it  should  be  rec^nized 
that  this  calculation  forms  part  of  a  careful  investigation  into  the  Mach  reflection  proc^ 
of  detonation  wav^,  and  therefore  a  full  account  will  appear  in  the  literature  in  due  course. 

The  ramp  calculation  employed  a  four  level  adaptive  grid.  The  coars«t  grid  contained 
400  by  18  cells,  and  a  further  three  levels,  each  with  a  rcfinenicnt  ratio  of  4,  were  used  to 
r^olve  the  details  of  the  flow.  Thus  the  finest  grid  level  is  equivaleat  to  a  uniform  mtsh  of 
256M  by  1 152  cells.  The  so-called  reactive  Euler  equations  were  used  to  model  the  flow|l4). 
In  essence,  a  single  reactant  ,4  is  converted  to  a  single  product  B  by  a  one-step  ifTevcrsible 
chemical  reaction  which  is  governed  by  .Arrhenius  kinetics.  Four  parameters  dictate  the 
basic  behaviour  of  the  detonation  wave.  These  arc:  a  non-dimensional  activation  energy,  E, 
a  non-dimensional  heat  release,  Q,  the  ratio  of  specific  heats,  y,  and  the  d^ree  of  overdrive, 
/,  for  the  detonation  wave.  For  the  case  presented  here,  these  parameters  took  the  values 
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Figure  8:  SdtB^a-type  Inage  aed  nuineti^  soot  tnM  for  a  detOBatktn  wa\%  reflecting 
fronj  a  mnp.  Key:  I^.,  Incident  Shock;  R-S.,  Reflated  Shwk;  M.S.»  Mach  Steia;  T.P., 
Triple  Pnot. 
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For  caption,  sec  page  opposite. 


10, 50, 1.2  and  1.2,  respectively.  The  computation  was  started  using  the  exact,  steady  ZND 
wave  solution[7].  The  resolution  of  the  computational  mesh  may  be  gauged  by  the  fact 
that  the  finest  grid  level  provides  15  mesh  cells  per  reaction  half-length,  Li,  the  distance 
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measured  from  the  detonation  front  by  which  half  the  reactants  are  consumed  in  the  steady 
ZriD  solution.  To  seed  the  transverse  wave  structure,  a  sinusoidal  perturbation  was  added 
to  the  pre-exponential  factor  in  the  Arrhenius  rate  term  for  the  first  few  time  steps.  The 
wavelength  of  the  perturbation  was  chosen  to  be  close  to  the  transverse  spacing  predicted 
by  linear  theory[5];  thus  to  get  10  detonation  cells  along  the  length  of  the  incident  front 
the  channel  width  was  taken  to  be  76.8  reaction  half-lengths.  The  operator  split  version  of 
Roe’s  scheme  which  has  been  described  by  Clarke  et  a/.[6]  was  used  to  integrate  the  flow, 
albeit  with  the  inclusion  of  some  additional  dissipation  so  to  avoid  the  failings  reported  by 
Quirk[14].  The  bulk  of  the  simulation  was  spent  propagating  the  incident  wave  to  the  foot 
of  the  ramp,  a  distance  of  some  15  channel  widths,  so  as  to  be  sure  that  the  detonation 
front  was  well  settled  before  it  interacted  with  the  ramp. 

We  close  this  section  with  a  reminder  that  the  AMR  algorithm  is  not  tied  to  any  one 
application.  Admittedly  the  theme  throughout  this  paper  has  been  one  of  detonation  flows, 
but,  there  is  no  reason  why  our  AMR  machinery  cannot  be  brought  to  bear  on  an  entirely 
different  application. 

5  Conclusions 

A  method  has  been  presented  which  describes  how  Quirk’s  adaptive  mesh  refinement  algo¬ 
rithm  (AMR)  may  be  parallelized.  This  algorithm  is  sufficiently  general  that  the  method  of 
parallelization  can  be  taken  as  a  template  for  a  whole  class  of  block-structured,  mesh  refine¬ 
ment  schemes.  The  method  of  parallelization  exploits  the  natural  coarse  grain  parallelism 
found  in  the  AMR  algorithm  so  as  to  leave  the  original  serial  algorithm  virtually  intact. 
Moreover,  since  it  makes  no  demands  on  the  target  parallel  hardware  other  than  assuming 
some  simple  message  passing  capability,  portability  across  a  range  of  platforms  is  ensured. 
To  date  we  have  demonstrated  that  the  parallel  algorithm  runs  on  both  workstation  clusters 
and  on  an  Intel  hypercube  system. 

The  robustness  of  the  parallel  AMR  algorithm  is  such  that  it  is  now  run  routinely 
on  workstation  clusters  for  large  scale  simulations  of  detonation  phenomena.  A  typical 
calculation  might  utilize  8  SuN  SparcIOs  for  up  to  forty  hours.  Efficiencies  of  over  80%  are 
reached  for  these  computations  even  when  inter-workstation  communication  is  via  ethernet 
rather  than  fibre  optic  links.  Such  high  efficiencies  stem  from  the  block-structured  nature 
of  the  AMR  algorithm;  by  dealing  with  blocks  of  cells  rather  than  individual  cells  one 
markedly  improves  the  ratio  of  computation  to  communication,  thus  aUeviating  some  of  the 
performance  problems  associated  with  using  low  speed  networks. 
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Thus  far  we  have  circumvented  the  thorny  issue  of  load  balancing.  For  our  detonation 
simulations,  although  the  computational  grid  is  adapting  to  the  flow  solution,  it  does  so 
in  a  fashion  which  allows  the  load  balancing  to  be  fixed  as  a  one-off  at  the  start  of  the 
calculation.  At  this  juncture,  considering  the  large  number  of  adaptions  that  take  place  in 
a  typical  simulation,  we  are  not  hopeful  that  a  general  purpose  load  balancing  procedure 
will  be  found  that  is  cheap  enough  to  be  used  during  the  cour.se  of  a  calculation.  Instead, 
we  envisage  employing  heuristic  procedures  much  as  is  done  with  mesh  refinement  monitor 
functions.  Experience  shows  that  such  refinement  functions  work  well  in  practice,  .so  there 
is  no  reason  to  believe  that  heuristic  load  bt-ilanciiig  functions  won't  also  work  well.  Further 
work  is  planned  to  see  whether  or  not  this  optimism  is  misplaced. 

Lastly,  it  is  worth  noting  that  workstation  clusters  provide  a  relatively  robust  and  cheap 
environment  in  which  to  develop  parallel  algorithms.  While  large  purpose-built  parallel 
machines  may  offer  unrivaled  computational  power,  they  do  not  always  come  complete  with 
stable  operating  systems!  We  were  able  to  design  and  test  the  ])arallcl  .AMR  algorithm 
on  a  reliable  system  of  workstations.  Then,  safe  in  the  knowledge  that  the  algorithm  was 
working  satisfactorily,  it  took  less  than  one  mornings  work  to  port  the  code  to  the  Intel 
hypercube.  It  would  doubtless  have  been  a  more  traumatic  exercise  to  develop  the  parallel 
AMR  algorithm  directly  on  the  hypercube. 
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